Summary

This document organises and prepares a literature review on synthetic data generation methods in sports science, and also examines the practical use cases where these methods are applied.

Aim

  1. To collect, clean, and combine reference data from several research databases, and prepare a summary for review and visualisation.
  2. To organise the synthetic sports datasets based on their primary use cases, and summarise their dataset characteristics, methodological approach, and intended analytical applications.

Approach

Aim 1

The review integrates references from five major databases:

  • IEEE
  • QUT library
  • Science Direct
  • Springer Nature
  • Web of Science

All .ris and .csv files were imported from the databases into a single structure (title, year, abstract).

Duplicate records were identified based on titles and removed, keeping only the first occurrence to ensure one unique record per publication.

The dataset (n = 224) was reviewed using the revtools::screen_abstracts() shiny interface for manual selection of studies using abstracts (n = 28).

15 files were selected after reading methodologies in NVivo.

  • Bibliographic data from 15 files were combined into one table.
  • An infographic was produced showing research methods and their categories.

Aim 2

Each dataset was assigned to one of the following categories:

  • Movement: pose estimation, motion tracking, biomechanical analysis.
  • Tactical: formation recognition, event detection, game-situation analysis.
  • Performance: fatigue prediction, load and performance monitoring.
  • Player: player identity, jersey recognition, technical skill prediction.
  • Injury: impact simulation, risk modelling, unsafe event reconstruction.

Results

  1. It was found that synthetic data can be mainly generated using GAN-based, simulation-based, or statistical approaches, and their comparison is summarised in the table below.
Criterion GAN-based Simulation-based Statistical
Realism ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐
Privacy ⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Data variety ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐
Technical complexity High Very High Medium
Best for Time-series, video, or image data Physical / injury simulation Variable-level control
  1. Synthetic datasets are distributed across the following five use-case categories.
  • Movement Analysis
Dataset Sport Method Use Case
SwimXYZ Swimming GAN/NN Underwater motion and pose estimation.
NBA2K Basketball GAN/NN 3D reconstruction and player pose tracking.
SoccerSynth Soccer Simulation Player and ball detection under varied visual conditions.
SMDRD Running Simulation Evaluation of monocular 3D pose estimation from broadcast-style video.
  • Tactical Analysis
Dataset Sport Method Use Case
SoccER Soccer Simulation Recognition of passes, shots, fouls, tackles, and possession events.
SAFFD American Football GAN/NN Offensive formation recognition from synthetic formations.
SHBGSD Handball/Basketball GAN/NN Classification of game situations such as attacks, counters, and penalties.
  • Performance Monitoring
Dataset Sport Method Use Case
SAMD Endurance Sports GAN/NN Fatigue prediction and training response modelling.
GFPD Gaelic Football GAN/NN Performance attenuation and physiological pattern generation.
SCTS Cycling GAN/NN Modelling cycling session intensity and load progression.
GTSRD Rugby Statistical Synthetic load and recovery monitoring for privacy-preserving analysis.
SPDAM Football Statistical Performance readiness and training load modelling.
  • Player Modelling
Dataset Sport Method Use Case
SC2DSJND American Football GAN/NN Jersey number recognition for improved player identification.
GFPAD Soccer Statistical Player attribute expansion to support skill prediction and classification.
  • Injury Analysis
Dataset Sport Method Use Case
GSHIKD American Football Simulation Head Impact classification modelling in American football using sensors.

Data Preparation

list.files("data/literature/ris/")
## [1] "ieee.ris"           "qut.ris"            "scienceDirect.ris" 
## [4] "springerNature.csv" "webofScience.ris"   "webofScience_.ris"
# Error in webofS, remove empty line
wos <- readLines("data/literature/ris/webofScience.ris")
wos <- wos[wos != ""] 
writeLines(wos, "data/literature/ris/webofScience_.ris")
# Load files
ieee <- read_bibliography("data/literature/ris/ieee.ris")
qut <- read_bibliography("data/literature/ris/qut.ris")
scienceDirect <- read_bibliography("data/literature/ris/scienceDirect.ris")
springerNature <- read_bibliography("data/literature/ris/springerNature.csv")
webofScience <- read_bibliography("data/literature/ris/webofScience_.ris")
head(ieee[,1:5])
# Rename columns in springerNature
names(springerNature)[names(springerNature) == "item_title"] <- "title"
names(springerNature)[names(springerNature) == "publication_year"] <- "year"

## Add a missing column and filled in with "NA"
springerNature$abstract <-NA

colnames(springerNature)
##  [1] "title"             "publication_title" "book_series_title"
##  [4] "journal_volume"    "journal_issue"     "item_doi"         
##  [7] "author"            "year"              "URL"              
## [10] "content_type"      "abstract"
# Merge files
merge <- function(...){
  list <- list(...)
  
  data <- lapply(list, function(df) {
    for (col in c("title", "year", "abstract")) {
      if (!col %in% names(df)) {
        df[[col]] <- NA
      }
    }
    df[, c("title", "year", "abstract")]
  })
  
  combined<- do.call(rbind, data)
  row.names(combined) <- NULL
  return(combined)
}

bibliography <- merge(ieee, qut, scienceDirect, springerNature, webofScience)

dim(bibliography)
## [1] 258   3
# Title preparition
bibliography$titleLower<-tolower(bibliography$title)
bibliography$titleLower<-strip(bibliography$titleLower, apostrophe.remove = TRUE)
# Check for duplicates
unique(bibliography$titleLower[duplicated(bibliography$titleLower)])
##  [1] "silhouettebased d human pose estimation using a single wristmounted â camera"                                                                                   
##  [2] "calibrate interactive analysis of probabilistic model output"                                                                                                   
##  [3] "patchemg fewshot emg signal generation with diffusion models for data augmentation to improve classification performance"                                       
##  [4] "kganbased semisupervised domain adapted human activity recognition"                                                                                             
##  [5] "task scheduling in threedimensional spatial crowdsourcing a social welfare perspective"                                                                         
##  [6] "improving pagerank using sports results modeling"                                                                                                               
##  [7] "soccer computer graphics meets sports analytics for soccer event recognition"                                                                                   
##  [8] "âœplay by playâ\u009d a dataset of handball and basketball game situations in a standardized space"                                                             
##  [9] "machine learningbased smart irrigation controller for runoff minimization in turfgrass irrigation"                                                              
## [10] "advancements in basketball action recognition datasets methods explainability and synthetic data applications"                                                  
## [11] "improvement of accuracy of underperforming classifier in decision making using discrete memoryless channel model and particle swarm optimization"               
## [12] "synthetic data for sharing and exploration in highperformance sport considerations for application"                                                             
## [13] "book of abstracts esmrmb online th annual scientific meeting october"                                                                                           
## [14] "ecr book of abstracts"                                                                                                                                          
## [15] "synthetic data as a strategy to resolve data privacy and confidentiality concerns in the sport sciences practical examples and an r shiny application"          
## [16] "a synthetic datadriven machine learning approach for athlete performance attenuation prediction"                                                                
## [17] "understanding accelerationbased load metrics from concepts to implementation"                                                                                   
## [18] "implementing multiple imputation for missing data in longitudinal studies when models are not feasible an example using the random hot deck approach"           
## [19] "crossmodal selfattention mechanism for controlling robot volleyball motion"                                                                                     
## [20] "artificial intelligence in the selection of topperforming athletes for team sports a proofofconcept predictive modeling study"                                  
## [21] "jersey number detection using synthetic data in a lowdata regime"                                                                                               
## [22] "a novel explainable artificial intelligence framework using knockoffs techniques with applications to sports analytics"                                         
## [23] "analysis of player tracking data extracted from football match feed"                                                                                            
## [24] "semantic representation and comparative analysis of physical activity sensor observations using mox sensor in real and synthetic datasets a proofofconceptstudy"
## [25] "d human pose data augmentation using generative adversarial networks for roboticassisted movement quality assessment"                                           
## [26] "dynamic ranking with the btl model a nearest neighbor based rank centrality method"
# Remove duplicated titles, keeping the first unique entry
bibliography_ <- bibliography[!duplicated(bibliography$titleLower), ]

# Check that duplicates are gone
any(duplicated(bibliography_$titleLower))
## [1] FALSE
# Use shiny app to filter based on Abstract
# screen_abstracts(bibliography_)

Bibliography

The 22 files were downloaded and imported to Nvivo. Manually coding and creating a matrix.

# Match number of selected references 28 with numbers of pdfs
files <- list.files("data/literature/reference/", pattern = "\\.pdf$", full.names = TRUE)
cat("Total PDF files found:", length(files), "\n")
## Total PDF files found: 14

Methods

infographic <- read_excel("data/literature/literatureReview/literatureReview.xlsx", 
                          sheet = "summary")
colnames(infographic)
##  [1] "file"              "aim.study"         "sport.domain"     
##  [4] "methods.model"     "group"             "group.description"
##  [7] "examples"          "dataset.type"      "dataset"          
## [10] "dataset.name"      "data.access"       "data.type"        
## [13] "description"       "use.case.type"     "use.case"         
## [16] "aim.dataset"
# Wrap text to multiple lines for readability
wrap_text <- function(x, width = 25) str_wrap(x, width = width)

infographic <- infographic %>%
  mutate(
    methods.model = wrap_text(methods.model, 20),
    group = wrap_text(group, 25),
    group.description = wrap_text(group.description, 30),
    examples = wrap_text(examples, 30)
  )

infographic
# Plot
ggplot(infographic,
       aes(axis1 = methods.model,
           axis2 = group,
           axis3 = group.description,
           axis4 = examples)) +
  geom_alluvium(aes(fill = group), width = 0.05, alpha = 0.35) +  
  geom_stratum(width = 0.5, fill = "grey95", color = "grey70") +
  geom_text(stat = "stratum",
            aes(label = after_stat(stratum)),
            size = 3,
            lineheight = 0.9,
            color = "black") +
  scale_x_discrete(limits = c("Methods", "Group", "Description", "Examples"),
                   expand = c(.05, .05)) +
  theme_minimal(base_size = 15) +
  theme(
    axis.title = element_blank(),
    axis.text.y = element_blank(),
    panel.grid = element_blank(),
    legend.position = "none",               
    plot.margin = margin(10, 30, 10, 30), 
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  ggtitle("Methodological Summary in Synthetic Data Generation for Sports Science")

Use Cases

infographic <- infographic %>%
  mutate(
    methods.model = wrap_text(methods.model, 100),
    description = wrap_text(description, 200),
    use.case = wrap_text(use.case, 100),
    aim.dataset = wrap_text(aim.dataset, 200)
  )

infographic
# Level 1: dataset.type
lvl1 <- infographic %>%
  distinct(dataset.type) %>%
  mutate(
    ids    = dataset.type,
    labels = dataset.type,
    parents = ""
  )

# Level 2: use.case.type
lvl2 <- infographic %>%
  distinct(dataset.type, use.case.type) %>%
  mutate(
    ids    = paste(dataset.type, use.case.type, sep = "-"),
    labels = use.case.type,
    parents = dataset.type
  )

# Level 3: sport.domain
lvl3 <- infographic %>%
  distinct(dataset.type, use.case.type, sport.domain) %>%
  mutate(
    ids    = paste(dataset.type, use.case.type, sport.domain, sep = "-"),
    labels = sport.domain,
    parents = paste(dataset.type, use.case.type, sep = "-")
  )

# Level 4: dataset
lvl4 <- infographic %>%
  distinct(dataset.type, use.case.type, sport.domain, dataset,
           dataset.name, description, methods.model, use.case, aim.dataset) %>%
  mutate(
    ids = paste(dataset.type, use.case.type, sport.domain, dataset, sep = "-"),

    labels = paste0(
      "<b>Dataset:</b> ", dataset, "<br>",
      "<b>Name:</b> ", dataset.name, "<br>",
      "<b>Description:</b> ", description, "<br>",
      "<b>Method:</b> ", methods.model, "<br>",
      "<b>Use Case:</b> ", use.case, "<br>",
      "<b>Aim:</b> ", aim.dataset
    ),

    parents = paste(dataset.type, use.case.type, sport.domain, sep = "-")
  )

treeD <- bind_rows(
  lvl1,
  lvl2,
  lvl3,
  lvl4
)

treeD
# Insert colours
levels<- c(
  "#E69F00", # Level 1 (dataset.type)
  "#009E73", # Level 2 (use.case.type)
  "#0072B2", # Level 3 (sport.type)
  "#000000"  # Level 4 (dataset)
)

treeD <- treeD %>%
  mutate(
    level = case_when(
      parents == "" ~ 1,
      grepl("^[^-]+-[^-]+$", ids) ~ 2,
      grepl("^[^-]+-[^-]+-[^-]+$", ids) ~ 3,
      TRUE ~ 4
    ),
    colors = levels[level]
  )

plot_ly(
  treeD,
  type = "treemap",
  ids = ~ids,
  labels = ~labels,
  parents = ~parents,
  marker = list(colors = ~colors),
  textinfo = "label+children"
)